Goto

Collaborating Authors

 Najaf


BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining

Merzah, Baqer M., Taami, Tania, Asoudeh, Salman, Mirzaee, Saeed, pour, Amir reza Hossein, Bengari, Amir Ali

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have recently gained attention in the life sciences due to their capacity to model, extract, and apply complex biological information. Beyond their classical use as chatbots, these systems are increasingly used for complex analysis and problem-solving in specialized fields, including bioinformatics. First, we introduce BIOPARS-BENCH, a dataset from over 10,000 scientific articles, textbooks, and medical websites. BioParsQA was also introduced to evaluate the proposed model, which consists of 5,231 Persian medical questions and answers. This study then introduces BioPars, a simple but accurate measure designed to assess LLMs for three main abilities: acquiring subject-specific knowledge, interpreting and synthesizing such knowledge, and demonstrating proper evidence. Comparing ChatGPT, Llama, and Galactica, our study highlights their ability to remember and retrieve learned knowledge but also reveals shortcomings in addressing higher-level, real-world questions and fine-grained inferences. These findings indicate the need for further fine-tuning to address the capabilities of LLM in bioinformatics tasks. To our knowledge, BioPars is the first application of LLM in Persian medical QA, especially for generating long answers. Evaluation of four selected medical QA datasets shows that BioPars has achieved remarkable results compared to comparative approaches. The model on BioParsQA achieved a ROUGE-L score of 29.99, which is an improvement over GPT-4 1.0. The model achieved a BERTScore of 90.87 with the MMR method. The MoverScore and BLEURT values were also higher in this model than the other three models. In addition, the reported scores for the model are MoverScore=60.43 and BLEURT=50.78. BioPars is an ongoing project and all resources related to its development will be made available via the following GitHub repository: https://github.com/amirap80/BioPars.


Unsupervised Sparse Coding-based Spiking Neural Network for Real-time Spike Sorting

Melot, Alexis, Wood, Sean U. N., Coffinier, Yannick, Yger, Pierre, Alibart, Fabien

arXiv.org Artificial Intelligence

Spike sorting is a crucial step in decoding multichannel extracellular neural signals, enabling the identification of individual neuronal activity. A key challenge in brain-machine interfaces (BMIs) is achieving real-time, low-power spike sorting at the edge while keeping high neural decoding performance. This study introduces the Neuromorphic Sparse Sorter (NSS), a compact two-layer spiking neural network optimized for efficient spike sorting. NSS leverages the Locally Competitive Algorithm (LCA) for sparse coding to extract relevant features from noisy events with reduced computational demands. NSS learns to sort detected spike waveforms in an online fashion and operates entirely unsupervised. To exploit multi-bit spike coding capabilities of neuromorphic platforms like Intel's Loihi 2, a custom neuron model was implemented, enabling flexible power-performance trade-offs via adjustable spike bit-widths. Evaluations on simulated and real-world tetrode signals with biological drift showed NSS outperformed established pipelines such as WaveClus3 and PCA+KMeans. With 2-bit graded spikes, NSS on Loihi 2 outperformed NSS implemented with leaky integrate-and-fire neuron and achieved an F1-score of 77% (+10% improvement) while consuming 8.6mW (+1.65mW) when tested on a drifting recording, with a computational processing time of 0.25ms (+60 us) per inference.


Adversarial Sample Generation for Anomaly Detection in Industrial Control Systems

Mustafa, Abdul, Khan, Muhammad Talha, Umer, Muhammad Azmi, Masood, Zaki, Ahmed, Chuadhry Mujeeb

arXiv.org Artificial Intelligence

--Machine learning (ML)-based intrusion detection systems (IDS) are vulnerable to adversarial attacks. It is crucial for an IDS to learn to recognize adversarial examples before malicious entities exploit them. In this paper, we generated adversarial samples using the Jacobian Saliency Map Attack (JSMA). We validate the generalization and scalability of the adversarial samples to tackle a broad range of real attacks on Industrial Control Systems (ICS). We evaluated the impact by assessing multiple attacks generated using the proposed method. The model trained with adversarial samples detected attacks with 95% accuracy on real-world attack data not used during training. The study was conducted using an operational secure water treatment (SWaT) testbed. Industrial control systems (ICS) comprise a significant portion of any state or nation's critical infrastructure (CI). Examples of such systems include water treatment plants and electric power grids, where an ICS regulates the physical processes. The physical processes consist of two primary parts: monitoring and controlling. The monitoring part maintains processes and ensures they are operating properly by measuring various signals acquired from sensors.


Effectively Steer LLM To Follow Preference via Building Confident Directions

Song, Bingqing, Han, Boran, Zhang, Shuai, Wang, Hao, Fang, Haoyang, Min, Bonan, Wang, Yuyang, Hong, Mingyi

arXiv.org Artificial Intelligence

Having an LLM that aligns with human preferences is essential for accommodating individual needs, such as maintaining writing style or generating specific topics of interest. The majority of current alignment methods rely on fine-tuning or prompting, which can be either costly or difficult to control. Model steering algorithms, which modify the model output by constructing specific steering directions, are typically easy to implement and optimization-free. However, their capabilities are typically limited to steering the model into one of the two directions (i.e., bidirectional steering), and there has been no theoretical understanding to guarantee their performance. In this work, we propose a theoretical framework to understand and quantify the model steering methods. Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations at inference time. More specifically, CONFST builds a confident direction that is closely aligned with users' preferences, and this direction is then added to the activations of the LLMs to effectively steer the model output. Our approach offers three key advantages over popular bidirectional model steering methods: 1) It is more powerful, since multiple (i.e. more than two) users' preferences can be aligned simultaneously; 2) It is simple to implement, since there is no need to determine which layer to add the steering vector to; 3) No explicit user instruction is required. We validate our method on GPT-2 XL (1.5B), Mistral (7B) and Gemma-it (9B) models for tasks that require shifting the output of LLMs across various topics and styles, achieving superior performance over competing methods.


Network Tomography with Path-Centric Graph Neural Network

Hu, Yuntong, Wang, Junxiang, Zhao, Liang

arXiv.org Artificial Intelligence

Network tomography is a crucial problem in network monitoring, where the observable path performance metric values are used to infer the unobserved ones, making it essential for tasks such as route selection, fault diagnosis, and traffic control. However, most existing methods either assume complete knowledge of network topology and metric formulas-an unrealistic expectation in many real-world scenarios with limited observability-or rely entirely on black-box end-to-end models. To tackle this, in this paper, we argue that a good network tomography requires synergizing the knowledge from both data and appropriate inductive bias from (partial) prior knowledge. To see this, we propose Deep Network Tomography (DeepNT), a novel framework that leverages a path-centric graph neural network to predict path performance metrics without relying on predefined hand-crafted metrics, assumptions, or the real network topology. The path-centric graph neural network learns the path embedding by inferring and aggregating the embeddings of the sequence of nodes that compose this path. Training path-centric graph neural networks requires learning the neural netowrk parameters and network topology under discrete constraints induced by the observed path performance metrics, which motivates us to design a learning objective that imposes connectivity and sparsity constraints on topology and path performance triangle inequality on path performance. Extensive experiments on real-world and synthetic datasets demonstrate the superiority of DeepNT in predicting performance metrics and inferring graph topology compared to state-of-the-art methods.


BOLIMES: Boruta and LIME optiMized fEature Selection for Gene Expression Classification

Phan, Bich-Chung, Ma, Thanh, Nguyen, Huu-Hoa, Do, and Thanh-Nghi

arXiv.org Artificial Intelligence

Gene expression classification is a pivotal yet challenging task in bioinformatics, primarily due to the high dimensionality of genomic data and the risk of overfitting. To bridge this gap, we propose BOLIMES, a novel feature selection algorithm designed to enhance gene expression classification by systematically refining the feature subset. Unlike conventional methods that rely solely on statistical ranking or classifier-specific selection, we integrate the robustness of Boruta with the interpretability of LIME, ensuring that only the most relevant and influential genes are retained. BOLIMES first employs Boruta to filter out non-informative genes by comparing each feature against its randomized counterpart, thus preserving valuable information. It then uses LIME to rank the remaining genes based on their local importance to the classifier. Finally, an iterative classification evaluation determines the optimal feature subset by selecting the number of genes that maximizes predictive accuracy. By combining exhaustive feature selection with interpretability-driven refinement, our solution effectively balances dimensionality reduction with high classification performance, offering a powerful solution for high-dimensional gene expression analysis.


Enhancing Phishing Email Identification with Large Language Models

Lee, Catherine

arXiv.org Artificial Intelligence

Phishing has long been a common tactic used by cybercriminals and continues to pose a significant threat in today's digital world. When phishing attacks become more advanced and sophisticated, there is an increasing need for effective methods to detect and prevent them. To address the challenging problem of detecting phishing emails, researchers have developed numerous solutions, in particular those based on machine learning (ML) algorithms. In this work, we take steps to study the efficacy of large language models (LLMs) in detecting phishing emails. The experiments show that the LLM achieves a high accuracy rate at high precision; importantly, it also provides interpretable evidence for the decisions.


Multi-megabase scale genome interpretation with genetic language models

Träuble, Frederik, Stuart, Lachlan, Georgiou, Andreas, Notin, Pascal, Mehrjou, Arash, Schwessinger, Ron, Chevalley, Mathieu, Branson, Kim, Schölkopf, Bernhard, van Duijn, Cornelia, Marks, Debora, Schwab, Patrick

arXiv.org Artificial Intelligence

Understanding how molecular changes caused by genetic variation drive disease risk is crucial for deciphering disease mechanisms. However, interpreting genome sequences is challenging because of the vast size of the human genome, and because its consequences manifest across a wide range of cells, tissues and scales -- spanning from molecular to whole organism level. Here, we present Phenformer, a multi-scale genetic language model that learns to generate mechanistic hypotheses as to how differences in genome sequence lead to disease-relevant changes in expression across cell types and tissues directly from DNA sequences of up to 88 million base pairs. Using whole genome sequencing data from more than 150 000 individuals, we show that Phenformer generates mechanistic hypotheses about disease-relevant cell and tissue types that match literature better than existing state-of-the-art methods, while using only sequence data. Furthermore, disease risk predictors enriched by Phenformer show improved prediction performance and generalisation to diverse populations. Accurate multi-megabase scale interpretation of whole genomes without additional experimental data enables both a deeper understanding of molecular mechanisms involved in disease and improved disease risk prediction at the level of individuals.


SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy

Nahid, Md Mahadi Hasan, Hasan, Sadid Bin

arXiv.org Artificial Intelligence

Machine learning (ML) models frequently rely on training data that may include sensitive or personal information, raising substantial privacy concerns. Legislative frameworks such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) have necessitated the development of strategies that preserve privacy while maintaining the utility of data. In this paper, we investigate the capability of Large Language Models (LLMs) to generate synthetic datasets integrated with Differential Privacy (DP) mechanisms, thereby enabling data-driven research and model training without direct exposure of sensitive information. Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process. We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data. To substantiate privacy guarantees, we assess the resilience of the generated synthetic data to membership inference attacks and related threats. The experimental results demonstrate that integrating DP within LLM-driven synthetic data generation offers a viable balance between privacy protection and data utility. This study provides a foundational methodology and insight into the privacy-preserving capabilities of LLMs, paving the way for compliant and effective ML research and applications.


Navigating Spatial Inequities in Freight Truck Crash Severity via Counterfactual Inference in Los Angeles

Wang, Yichen, Yin, Hao, Yang, Yifan, Zhao, Chenyang, Wang, Siqin

arXiv.org Artificial Intelligence

Freight truck-related crashes pose significant challenges, leading to substantial economic losses, injuries, and fatalities, with pronounced spatial disparities across different regions. This study adopts a transport geography perspective to examine spatial justice concerns by employing deep counterfactual inference models to analyze how socioeconomic disparities, road infrastructure, and environmental conditions influence the geographical distribution and severity of freight truck crashes. By integrating road network datasets, socioeconomic attributes, and crash records from the Los Angeles metropolitan area, this research provides a nuanced spatial analysis of how different communities are disproportionately impacted. The results reveal significant spatial disparities in crash severity across areas with varying population densities, income levels, and minority populations, highlighting the pivotal role of infrastructural and environmental improvements in mitigating these disparities. The findings offer insights into targeted, location-specific policy interventions, suggesting enhancements in road infrastructure, lighting, and traffic control systems, particularly in low-income and minority-concentrated areas. This research contributes to the literature on transport geography and spatial equity by providing data-driven insights into effective measures for reducing spatial injustices associated with freight truck-related crashes.